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METHODS AND APPARATUS FOR SEMANTIC UNIT BASED AUTOMATIC 



INDEXING AND SEARCHING IN DATA ARCHIVE SYSTEMS 



Field of the Invention 

The present invention generally relates to data archive systems and, more particularly, 
5 to improved indexing and searching methods and apparatus for use in such systems. 

Background of the Invention 

Several patents and patent application deal with audio-indexing and searching of 
audio data, e.g., U.S. Patent No. 5,649,060 issued to EUozy et al on July 15, 1997; U.S. 
Patent No. 5,794,249 issued to Orsolini et al. on August 11, 1998; and U.S. patent 

10 application identified by serial no. 09/108,544 (attorney docket no. Y0998-120), entitled: 
"Audio-Video Archive and Method for Automatic Indexing and Searching," filed on July 
1 , 1 998, the disclosures of which are incorporated by reference herein. All of the approaches 
taken in these patents and the patent appUcation use a word as a basic unit for indexing and 
search. Typically in these methods, audio data is transcribed (via automatic speech 

1 5 recognition or manually), time stamped and indexed via words. 

In a word-based system, before the searching can be started, a vocabulary and a 
language model based on known words must be prepared. Thus, by definition, there are 
always words that are unknown to the system. Unfortunately, the searching mechanism can 
only work with words resulting in a good language model score, i.e., known words. 

20 In an attempt to create a system capable of searching using an entry which is 

unknown to the system, phone-based indexing methods have been proposed. This method 
includes generating an acoustic transcription for words and indexing speech segments via 
acoustic phones. However, these phone-based indexing methods are not very efficient since 
there can be different phonetic descriptions for the same word and the phonetic recognition 

25 accuracy can be low, e.g., lower than a word recognition accuracy. 
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These difficulties are even more apparent in a system operating in a language for 
which the unit "word" in speech and text may be ambiguous, e.g., the Chinese language, or 
in a language that has a very large number of word forms, e.g., Slavic languages. 

For most European languages, word boundaries exist in printed text, as well as in 
5 computer text files. These boundaries are represented as blank spaces between words. 
However, for most of the Asian languages, including, e.g., Chinese, Japanese, Korean, Thai, 
and Vietnamese, such word boundaries neither exist in printed form, nor in computer text 
files. Thus, word-based indexing and searching methods can not be appUed to these 
languages. Phone-based indexing and searching methods for these languages have similar 
10 problems as those mentioned above. 

Thus, a need exists for methods and apparatus for indexing and searching audio data, 
and the hke, which minimizes and/or eliminates these and other deficiencies and hmitations, 
and which may be used with a greater number of languages. 

Summary of the Invention 

1 5 The present invention provides for improved indexing and searching of audio data, 

and the like, using minimal semantic unit based methodologies and/or apparatus. It is to be 
appreciated that "a minimal semantic units" are defined as small, preferably the smallest, 
units of a language that are known to have semantic meaning. Examples of semantic units 
that may be used are syllables or morphemes. Such an inventive approach may be used in 

20 conjunction with languages which have difficulty being adapted for use with existing 
approaches, e.g., Asian languages. 

It is to be appreciated that a "morpheme" is a minimal semantic unit in a language 
that is recurrent and meaningfiil. It may be a part of a word, or a word, such as the three 
units in 

25 the word "friendliness," that is "friend-," "U-," and "ness." In Western languages, there is 
a distinction between a free morpheme and a bound morpheme. A free morpheme can be a 
standalone word, such as "friend." A bound morpheme cannot be used by itself, such as "h" 
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and "ness." A morpheme can be a single syllable, a group of syllables, or a consonant 
attached to a syllable, such as the "s" in "man's shirt." In most East Asian languages, since 
there is no word boundaries in printed text or in computer files, the distinction between free 
morpheme and bound morpheme is not expUcit, In those languages, a morpheme is a more 
adequate unit of language than a word. 

Further, it is to be appreciated that a "syllable" is a group of phonemes comprising 
a vowel or continuant, alone or combined with a consonant or consonants, representing a 
complete articulation or a complex of articulations, and comprising the unit of word 
formation. It is identifiable with a chest pulse, and with a crest of sonority. A syllable can 
be open if it ends with a vowel, or closed if it ends with a consonant. In the above example, 
"fiiend," "li," and "ness," are three syllables, with "li" open, and "fiiend" and "ness" closed. 

The semantic unit known as a morpheme exists in many Asian languages. For 
example, in many East Asian languages, such as Chinese, Thai, Vietnamese, with a few 
exceptions, almost all morphemes are monosyllabic. Thus, in those languages, the concept 
of morpheme and syllable are interchangeable. 

Also, in Chinese, each syllable is represented by a character, a so-called Hanzi. The 
number of syllables and the number of Hanzi are finite. In modem standard spoken Chinese, 
Mandarin, the total number of different syllables is 1,400. In modem standard written 
Chinese, the number of commonly used characters is 6,700 in mainland Chinese, and 1 3 ,000 
in Taiwan. 

Accordingly, in a broad aspect of the present invention, methods and apparatus are 
provided for indexing and searching of audio data, and the like, which are based on minimal 
semantic units such as, for example, syllables and/or morphemes. In this manner, such 
inventive methods and apparatus for indexing and searching audio data, and the like, 
minimize and/or eliminate deficiencies and limitations associated with existing indexing and 
searching systems (e.g., word-based systems). Further, such inventive methods and 
apparatus for indexing and searching audio data, and the like, may be used with a greater 
number of languages. 
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Thus, in one exemplary embodiment of the invention for Chinese language, a 
searching engine may be provided which is based on characters, or Hanzi. A statistical 
language model built upon a large text corpus is used to execute speech recognition. The 
sought-after information (data to be searched) is formatted in terms of one character, or a 
5 sequence of characters. The searching mechanism compares the text with the target. 

In another exemplary embodiment of the invention for Chinese language, a searching 
engine may be provided which is based on phonetic syllables. A statistical language model 
based on phonetic syllables is built from a large text corpus, by converting the characters into 
phonetic syllables. The size of the language model is much smaller. The sought-after 
10 information is formatted in terms of one phonetic syllable, or a sequence of phonetic 
syllables. 

Observing the fact that syllables in Chinese bear semantic information, we generalize 
syllable based audio-indexing as follows. The present invention employs a semantic unit 
that is typically smaller than a word and has a imique acoustic representation. S emantic xmits 

15 allow to build language models that represent semantic information and improve the 
decoding accuracy of automatic speech recognition (ASR) that is based on a vocabulary 
comprised of semantic units. As mentioned, examples of such units are a syllable (e.g., in 
Chinese language) or amorpheme (e.g., in Slavic languages) for transcription of audio data, 
indexing and search. This methodology is generally appUcable to most languages since the 

20 unit syllable is clear, and the number of possible syllables in a language is finite. For those 
languages, to use the unit syllable as the basic building block of searching is more efficient. 
This approach also resolves the above-mentioned problem of unknown words, since a system 
employing the methodology knows all syllables that may be used in its appUcable language. 
For example, such languages that may be supported by this inventive approach may 

25 include, but are not limited to: 

a) Chinese. In the standard dialect (Mandarin, or Putonghua, based on Beijing 
dialect), the total number of allowed acoustic syllables is less than 1,800. The rate of 
syllables of average speech is 4-5 syllables per second. 
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b) Korean. There are less then 2,400 acoustically allowed syllables. The writing 
system is totally based on acoustic syllables. The rate of syllables of average speech is 4-5 
syllables per second. 

c) Japanese. There are only 105 acoustically allowed syllables. The rate of syllables 
5 of average speech is 6-7 syllables per second. 

d) Vietnamese. There are 3,000 different syllables. The writing system is totally 
based on acoustic syllables. The rate of syllables of average speech is 4-5 syllables per 
second. 

Similarly, languages that have a very large number of word forms (like several 
1 0 million word forms in Slavic languages) have a relatively small number of morphemes (e.g., 
50,000 morphemes in Russian language). For those languages, an automatic speech 
recognition system returns a string of acoustic syllables or morphemes. This can be done 
with a language model based on acoustic syllables or morphemes. The word to be searched 
is first rendered into a string of syllables. Those syllable strings are then matched against the 
15 decoded acoustic syllable database. 

It is to be appreciated that the methodologies of the present invention are more 
straightforward and faster than the word or word-tag based method. Data compression is 
also more efficient due to the finite number of syllables and morphemes. 

These and other obj ects, features and advantages of the present invention will become 
20 apparent from the following detailed description of illustrative embodiments thereof, which 
is to be read in connection with the accompanying drawings. 

Brief Description of the Drawings 

FIG. 1 is a block diagram of an apparatus for indexing and searching an audio 
recording via syllables according to an embodiment of the present invention; 
25 FIG. 2 provides examples of searching queries and media according to an 

embodiment of the present invention; 
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FIG. 3 is a block diagram of production of a syllable language model according to 
an embodiment of the present invention; 

FIG. 4A is a flow chart of a syllable based audio indexing method according to an 
embodiment of the present invention; 
5 FIG. 4B is a flow chart of a syllable based audio searching method according to an 

embodiment of the present invention; and 

FIG. 5 is a block diagram of a hardware implementation of an audio indexing and 
searchmg system according to an embodiment of the present invention. 

Detailed Descriptioa of Preferred Embodiments 

1 0 The present invention will be explained below in the context of an illustrative syllable 

based indexing and searching implementation. However, it is to be imderstood that the 
present invention is not limited to such a particular implementation. Rather, the invention 
is more generally applicable to indexing and searching of audio data using semantic units, 
syllables being just one exmiple of a semantic unit. For example, the invention 

15 advantageously finds application in any implementation where it is desirable to provide 
audio based data indexing and searching capabilities to a user such that the user does not 
need to be concerned with entering unknown words in his query to the system. The 
invention is particularly suitable for use with such languages as mentioned above, e.g., Asian 
and Slavic languages. However, the invention is not limited to use with any particular 

20 language. 

Referring now to FIG. 1, apparatus for indexing and searching an audio recording 
via syllables according to an embodiment of the present invention is shown. The apparatus 
1 00 operates in the following manner. Audio data is recorded by an acoustic recorder unit 
102. The audio data is stored in data storage 104. The audio data is also processed by a 
25 syllable speech recognizer 1 06, An example of a speech recognizer that may be employed 

by the invention is described in C.J. Chen et al., "A Continuous Speaker-hidependent 
Putonghua Dictation System," 3rd International Conference on Signal Processing 
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Proceedings, pp. 821- 824, 1996, the disclosure of which is incorporated herein by 
reference. A standard speech recognition system, such as that described in the above- 
referenced Chen et al. article, can be adapted to use a syllable based language model 108, 
the generation of which will be explained below, to provide the functions of the syllable 
5 speech recognizer 106. Given a syllable based language model according to the invention 
and given the fact that such a model is generally simpler than a word based language model 
in a standard speech recognition system, one of ordinary skill in the art will appreciate how 
to adapt a standard speech recognition system to operate as a syllable speech recognizer 
106 using a syllable based language model 108. 

10 It is to be appreciated that, in one embodiment of the invention, syllables may be 

phonetically based. Phonetic syllables reflect different pronunciations of syllables. In 
Chinese, phonetic syllables vary in different parts of the coimtry (despite the fact that a 
textual representation does not depend on a geographical location). In another embodiment 
of the invention, phonetic syllables comprise "tonemes" that reflect phonetic and intonation 

15 information, see the above-referenced Chen et al. article. A toneme is an intonation 

phoneme in a tone language. 

The syllable speech recognizer 106 using the syllable based language model 108, 
in a similar manner as a standard speech recognition system uses a word based language 
model, produces a decoded text (i.e., transcription) that is comprised of syllables 110. This 

20 syllable textual is time stamped, as will be explained, in unit 112 and stored with syllable 
indexes in a syllable index storage unit 1 14. The syllable index storage unit 1 14 contains 
indexes, e.g., time stamps, associated with the decoded syllable data. These time stamps, 
as is explained in the example below, are used to retrieve the corresponding audio data in 
the audio data storage 104 in response to a search query. 

25 For example, in one preferred embodiment, an index stored in unit 114 contains the 

address where the data for a syllable can be found in data storage 1 04. It is to be appreciated 
that some syllables may occur several times during the recording of the audio data by the 
acoustic recorder 1 02. The data from the recorder is stored in unit 1 04. An index in unit 114 
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points to where a particular syllable is stored in the unit 104. One simple way to indicate 
where the syllable is stored is to indicate times when this syllable was spoken. Thus, the 
index can be related to a set of times when the syllable was spoken. This is accomplished 
via the time stamping ofthe syllables by xmit 1 12, The conversion oftime in the storage address 
5 allows to retrieve all locations where the data related to the syllable in 104 is stored. 

By way of a simple example, assume that a sentence decoded by the syllable speech 
recognizer 106 contains a sequence of syllables that are aUgned to an audio (stored in unit 
104) sentence whereby the sequence of syllables is represented as: SI, S2, S3, S4, SI, S4, 
SI, S2, S7, S8, S7. These syllables have been time stamped for times: tl-t2, t3-t4, t5-t6, 

1 0 tm-tn. Assume the audio sentence is represented as audio segments: audi , aud2, aud3, aud4, 
audS, aud6, aud7, audS, aud9, audlO, audi 1 . Thus, the index data stored in unit 1 14 may be 
as follows: S 1 : audi , aud5, aud6; S2: aud2, audS; S3 : aud3 ; S4: aud4, aud6; S7: aud9, audi 1 ; 
S8: audio. This means that a syllable SI is stored in 1st, 5th and 6th places (segments) in 
the audio sentence that is stored in the data storage 104. Therefore, to play segments that 

1 5 correspond to S 1 , one can go to corresponding locations in data storage 1 04 that are indicated 
in the index. 

It is to be appreciated that while time stamping is a convenient way to index the 
decoded data, any other applicable indexing technique may be employed. The above 
process generally comprises the data indexing process according to this particular 

20 embodiment of the invention. 

The syllable index storage unitl 14 is connected to a syllable based search device 
116. Any conventional search methodology may be employed by the search device. The 
syllable search device 116 receives as input queries 118 from a user 124 via input device 
122. The input device may, for example, be one of the following: a keyboard; an 

25 automatic speech recognition (ASR) system; and automatic handwriting recognition (AHR) 
system, etc. The syllable query may be processed by query processing module 1 20, as will 
be explained, prior to being submitted to the search device. The syllable query 1 1 8 is used 
by the search device 1 16 to identify audio segments in the data storage 104. This may be 
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accomplished by the audio segments being ahgned to textual data. For example, audio data 
is parameterized by time, and syllables in the sentence are mapped in time intervals in audio 
data. An example of this was given above with regard to the indexing operation. The 
technique ofaUgningofaudio data to textual data is performed by the recognizer 106. When 
5 the recognizer decodes speech, it associates textual parts (e.g., syllables) to corresponding 
pieces of audio data. 

Thus, a syllable in the users query may be associated with or matched to one or 
more audio segments stored in audio storage 104 by identifying the index in index storage 
114 that corresponds to the syllable in the query. That is, if the user query contains 
10 syllable SI, then audio segments audi, aud5, aud6 are identified based on the indexing 
operation explained above. Once the audio data segments are identified, they are played 
back to the user via a playback/output device 126. The device 126 may therefore include 
a play back speaker. The user query 1 1 8 can contain additional information that helps to 
localize the search. 

15 The above scheme is a simplified example of audio indexing/searching via 

syllables. That is, depending on the application, additional features can be implemented. 
Namely, the audio data may be further indexed based on attributes associated with the 
person who generated the audio data, i.e., the speaker. This may be accompUshed in 
indexer and storage unit 128. That is, attributes associated with the speaker, e.g., name, 

20 sex, age, may be extracted from the audio data and used to index and store the audio data 

provided. These attributes may be expressly spoken by the person (e.g., "My name is ") 

and decoded by a speech recognizer or determined via conventional speaker recognition 
techniques. Alternatively, the audio data can be labeled with speaker names in order to 
enhance the audio search portion of the system. Labeling audio data with speaker names 

25 is discussed in the U.S. patent apphcation identified by serial no. 09/294,214 (attorney 

docket no. Y0998-398), entitled "System and Method for Indexing and Querying Audio 
Archives," filed on April 16, 1999, the disclosure of which is incorporated herein by 
reference. 
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Accordingly, for example, the user can restrict a search of a given syllable query 
to some speakers in a conference. As mentioned, the stored audio data can be also 
associated with speaker biometrics that provide additional information about speakers (for 
example, social user status, age, sex, etc.) as is described in the U.S. patent application 
5 identified by serial no. 09/371,400 (attorney docket no. Y0999-227), entitled 

"Conversational Data Mining," filed on August 10, 1999, the disclosure of which is 
incorporated herein by reference. 

Stored audio data can also be marked with labels providing some other information. 
This information can include information when the audio data was produced, places where 
10 it was produced, etc. The audio data can also be associated with video data that was 
recorded simultaneously with the audio data and stored in data storage unit 104. This 
permits a user to add video related queries to the audio related queries he enters at the input 
device 122. In this case, the search device may further implement video image recognition 
searching techniques. 

15 It is to be appreciated that one or more of these additional indexing features (e.g., 

speaker biometrics, video data, etc.) may be implemented in accordance with the apparatus 
100 of FIG. 1 in the indexer and storage unit 128. In the case of indexing and storing both 
audio and video data, the hierarchical index storage and searching techniques as described 
in the above-incorporated U.S. patent apphcation identified by serial no. 09/108,544 

20 (attorney docket no. Y0998-120), entitled "Audio/video Archive and Method for 
Automatic Indexing and Searching," filed on July 1, 1998. In the hierarchical search, 
syllable becomes one of the layers in the hierarchical pyramid. As will be explained 
below, FIG. 2 depicts certain of these additional indexing and searching features which 
apparatus 100 may implement. 

25 Results of user query search can be represented to the user in various other ways 

than explained above. For example, in accordance with a playback output device 126 that 
includes a display, the user can first view a printed decoded (syllable) output and, after 
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viewing the whole decoding output, the user 124 can decide what part of audio data he 
would like to play back simply by clicking (using a mouse included as part of the input 
device 122) on this part of textual output. In another embodiment, the user also can view 
video data that is associated with audio data that was found by the searching device 114 
5 in accordance with query requests. 

In yet another embodiment of the invention, the audio data is played back starting 
from the syllable that was indicated by the user query until the user stops the audio play 
back (via the input device) or until a particular time duration of the audio segment, as 
specified in the user query, has expired. 

1 0 Still further, the user query can also consist of words rather then a set of phonetic 

syllables. In this case, words are transformed into a sequence of syllables using a text-to- 
phonetic syllable map. Such a map may be generated in any conventional manner. This 
text-to-syllable map can employ a table that associates, with each syllable, a set of possible 
phonetic syllables. This map/table may be implemented by the query processing module 

15 1 20. In the search mode, the number of phonetic syllables associated with an input textual 

syllable can be restricted if additional data is provided (for example, geographical location 
where audio data was produced). 

The user query also can contain relatively long textual corpora rather than several 
words or syllables. The user can have a text of spoken speech (for example, if he himself 

20 read some text to record audio data). In this case, a textual corpora is mapped in a string 
of (phonetic) syllables and a specific search mechanism implemented in the search device 
1 14 can be used to find audio data that match a long string of syllables. This mechanism 
is described in the above-incorporated U.S. Patent No. 5,649,060. It allows to match audio 
data with a reference textual corpora even when relatively low quality ASR is used. It 

25 exploits time stamping of a textual corpus and matches a small nimiber of portions in the 

reference script with portions in the stored decoded output. 
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The methods that are used for phonetic syllable indexing can be used with other 
techniques of splitting of words into smaller units, for example, morphemes in Slavic 
languages. 

Referring now to FIG. 2, examples of searching queries and media according to an 
5 embodiment of the present invention are depicted. Media for search 202 can contain both 

audio data 204 and video data 206. The media is split into units 208 used for indexing. 
It is to be appreciated that this splitting may be done in the query processing module 120. 
Examples of audio units are depicted in block 210. Such units may include: text portions 
(e.g., phrases, paragraphs, chapters, poems, stories), words, syllables, phonetic syllables, 
10 morphemes, characters, and other semantic units (e.g., roots in Slavic languages). Video 
% data can be split into video portions 212. This can also be done in the query processing 

module 120, see the above-referenced U.S. patent application identified by serial no. 
rj 09/108,544 (attorney docket no. Y0998-120), entitled "Audio/video Archive and Method 

for Automatic Indexing and Searching." 
1 5 The searching device 116 (same as in FIG. 1 ), in response to receipt of query units 

208 can employ one or more of the features depicted in block 214 to assist or produce a 
r search: (i) hierarchical indexing (e.g., phonetic syllables point to syllables, syllables point 

O to words and words c^ point to phrases); (ii) labeling used to restrict a search (e.g., 

^1 location, speaker names, time period etc.); (iii) time stamping helps to index audio data and 

20 align it to textual data; and (iv) a language luiit model is trained from a string of units (e.g., 

syllables) and increases the accuracy of mapping audio data into string of units (e.g., 
syllables). 

The search system 116 may also use an automatic boundary marking system that is 
applied to a query 118. This is used to split the user input into words. Recall that in some 
25 languages characters are not separated into words with spaces. This allows searching via 
words (not only via syllables). Found portions of audio (e.g., that correspond to syllables or 
words) are played in via unit 126 to the user so that he can decide which portion of audio is 
needed. 
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As mentioned, the syllables can point to other hierarchical levels of data (e.g., as 
described in block 214 of FIG. 2). For example, audio can be accompanied with video and 
therefore this video data can be shown to the user (e.g., via unit 126) together with audio. 

Audio data can be represented with cepstra (i.e., an efficient compressed form of 
5 representation of audio). The cepstra can be converted to audio data that can be played to 
a user. The quality of the audio data obtained fi-om cepstra cm be relatively low but may be 
suitable in some apphcations, e.g., just to represent a content of the stored phrase. Since 
cepstra requires less storage capacity than full audio, the search and play back can be performed 
faster. This cepstra can point to full quality audio that can be used if the user needs a high 
1 0 quality output. Such an interface is further described in the above-incorporated U.S. patent 
apphcation identified by serial no. 09/108,544 (attorney docket no. Y0998-120), entitled 
"Audio/video Archive and Method for Automatic Indexing and Searching." 

In another embodiment, atextual output can also be represented as stenogr^hertranscri^ 
(i.e., rather than a decoder output). Stenography is similar to a decoder, but textual data is 
15 produced by a stenographer and can be more accurate than a decoder output. This 
stenographer output can be presented to the user 124 via unit 126, if this stenographer output 
is available. Therefore, a user can point to different places in the stenographer output and 
they will be played back as audio that is aligned to the stenographer data. 

Referring to FIG. 3, a block diagram of a method of producing a syllable language 
20 model according to an embodiment of the present invention is shown. This is the syllable 
language model 1 08 that may be used by speech recognizer 1 06 of FIG. 1 . Textual corpora 
300 is used to produce strings of syllables 302 (e.g., via tables that map strings of 
characters into syllables). Strings of syllables give rise to syllable counts 304. In order to 
produce a language model of phonetic syllables 306, it is necessary to know how syllables 
25 are pronounced. Since the same syllables can have different pronunciations, this data 
cannot be extracted directly from a textual corpus. As a result, the audio data 308 
corresponding to the text 300 is transcribed (block 310). Transcription 310 may be 
generated manually or using automatic speech recognition that aligns phonetic syllables 
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to a string of spoken syllables. Phonetic syllables 312 and syllables 314 generated as part 
of the transcription 310 are then used to derive respective probabilities of distribution of 
a phonetic syllable given a syllable (block 316). Syllable counts 304 and conditional 
distributions of phonetic syllables 3 1 4 are used to construct the language model of phonetic 
5 syllables 306. Given the syllable counts 304 and the conditional distributions of phonetic 
syllables 314, one of ordinary skill in the art will appreciate how to construct the language 
model of phonetic syllables 306. For example, the procedure is similar to constructing a 
language model for classes (e.g., Frederick Jelinek, "Statistical Methods for Speech 
Recognition," The MIT Press, Cambridge, 1998, the disclosure of which is incorporated 

1 0 herein by reference) or a language model for morphemes (e.g., U.S. Patent No. 5,835,888 
issued November 10, 1998, entitled "StatisticalL^guage Model for Inflected Languages," 
the disclosure of which is incorporated herein by reference). 

Referring now to FIG. 4A, a flow chart of a syllable based audio indexing method 
according to an embodiment of the present invention. In step 400, audio data to be indexed 

15 and stored is recorded. In step 402, the audio data is decoded into a transcription 
comprising strings of syllables (or morphemes). In step 404, the syllables are indexed by 
time stamping the syllables (or morphemes). Lastly, in step 406, the syllables (or 
morphemes) are stored in accordance with the time stamp indexes. 

Referring now to FIG. 4B, a flow chart of a syllable based audio searching method 

20 according to an embodiment of the present invention. It is to be appreciated that the search 
method of FIG. 4B is preferably employed in connection with data indexed according to the 
mdexing method of FIG. 4A. In step 408, a user enters a query in order to retrieve some 
portion of the stored acoustic data. The query is processed in step 410. As explained above, 
this may include transforming words entered by a user into a sequence of syllables using 

25 a text-to-phonetic syllable map. The user may also directly enter syllables rather than 
words. In step 412, the syllables are used to retrieve the desired audio data segments from 
storage. Lastly, in step 414, the audio segments are played back to the user. 
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Referring now to FIG. 5, a block diagram is shown of an exemplary hardware 
architecture for implementing one, more or all of the elements of the apparatus 100 shown 
in FIG. 1 . In this embodiment, the apparatus 100 may be implemented by a processor 500, 
memory 502, and I/O devices 504. It is to be appreciated that the term "processor" as used 
5 herein is intended to include any processing device, such as, for example, one that includes 
a CPU (central processing unit). For example, the processor may be a digital signal 
processor, as is known in the art. Also the term "processor" may refer to one or more 
individual processors. The term "memory" as used herein is intended to include memory 
associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory 

1 0 device (e.g., hard drive), a removable memory device (e.g., diskette), flash memory, etc. In 
addition, the term "input/output devices" or "I/O devices" as used herein is intended to 
generally include, for example, one or more input devices, e.g., microphone, keyboard, 
mouse, etc., for inputting data and other signals to the processing xmit, and/or one or more 
output devices, e.g., display, speaker, etc., for providing results associated with the 

15 processing unit. For example, the display or speaker may provide a user with play back 
information retrieved by the system. Accordingly, computer software including instructions 
or code for performing the methodologies of the invention, as described herein, may be 
stored in one or more of the associated memory devices (e.g., ROM, fixed or removable 
memory) and, when ready to be utihzed, loaded in part or in whole (e.g., into RAM) and 

20 executed by a CPU. In any case, it should be understood that the elements illustrated in the 
figures may be implemented in various forms of hardware, software, or combinations 
thereof, e.g., one or more digital signal processors with associated memory, application 
specific integrated circuit(s), fimctional circuitry, one or more appropriately programmed 
general purpose digital computers with associated memory, etc. Given the teachings of the 

25 invention provided herein, one of ordinary skill in the related art will be able to contemplate 
other implementations of the elements of the invention. 

Although illustrative embodiments of the present invention have been described 
herein with reference to the accompanying drawings, it is to be imderstood that the invention 
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is not limited to those precise embodiments, and that various other changes aad 
modifications may be affected therein by one skilled in the art without departing from the 
scope or spirit of the invention. 
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Claims 

What is claimed is: 

1 J/A method of processing audio-based data associated with a particular language^ 

2 the method comprising the steps of: 

3 storing the audio-based data; 

4 generating a textual representation of the audio-based data, the textual 

5 representation being in the form of one or more semantic units corresponding to the audio- 

6 based data; and 

7 indexing the one or more semantic units and storing the one or more indexed 

8 semantic units for use in searching the stored audio-based data in response to a user query. 

1 2. The method of claim 1, wherein the semantic unit is a syllable. 

1 3, The method of claim 1, wherein the syllable is a phonetically based syllable. 

1 4. The method of claim 1, wherein the semantic unit is a morpheme. 

1 5 . The method of claim 1 , wherein the generating step comprises decoding the audio- 

2 based data in accordance with a speech recognition system. 

1 6. The method of claim 5, wherein the speech recognition system employs a 

2 semantic unit based language model. 

1 7. The method of claim 1, wherein the indexing step comprises time stamping the 

2 one or more semantic units. 

1 8, The method of claim 1, wherein the searching step comprises: 



Y0999-426 



17 



2 processing the user query to generate one or more semantic units representing the 

3 information that the user seeks to retrieve; 

4 searching the one or more indexed semantic units to find a substantial match with 

5 the one or more semantic units associated with the user query; and 

6 retrieving one or more segments of the audio-based data using the one or more 

7 indexed semantic units that match the one or more semantic units associated with the user 

8 query. 

1 9. The method of claim 8, wherein the searching step further comprises presenting 

2 the retrieved data to the user. 

1 10. The method of claim 1, wherein the particular language is an Asian based 

2 language. 

1 11. The method of claim 10, wherein the particular language is Chinese. 

1 12. The method of claim 1 1 , wherein the semantic unit is a Chinese character. 

1 13. The method of claim 1, wherein the particular language is a Slavic based 

2 language. 

1 14. The method of claim 1, wherein the one or more semantic units are indexed 

2 according to speaker attributes. 

1 15. The method of claim 1, wherein the one or more semantic units are indexed 

2 according to at least one of when the audio based data was produced and where the audio 

3 based data was produced. 
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1 1 6. The method of claim 1 , further comprising the step of storing video based data 

2 associated with the audio based data for use in searching the stored audio based data and 

3 the video based data in response to a user query. 

1 17. The method of claim 16, wherein the searching step includes a hierarchical 

2 search routine. 

1 18. The method of claim 1 , wherein the generating step comprises stenogf aphically 

2 traascribing the audio-based data to generate the textual representation. 

1 V9. Apparatus for processing audio-based data associated with a particular 

2 langimge, the apparatus comprising: 

3 at least one processor operative to: (i)store the audio-based data; (ii) generate a 

4 textual representation of the audio-based data, the textual representation being in the form 

5 of one or more semantic units corresponding to the audio-based data; and (iii) index the 

6 one or more semantic units and store the one or more indexed semantic units for use in 

7 searching the stored audio-based data in response to a user query. 

1 2J9C An audio-based data indexing and retrieval system for processing audio-based 

2 data a^ociated with a particular language, the system comprising: 

3 memory for storing the audio-based data; 

4 a semantic unit based speech recognition system for generating a textual 

5 representation of the audio-based data, the textual representation being in the form of one 

6 or more semantic units corresponding to the audio-based data; 

7 an indexing and storage module, operatively coupled to the semantic unit based 

8 speech recognition system and the memory, for indexing the one or more semantic units 

9 and storing the one or more indexed semantic units; and 
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a search engine, operatively coupled to the indexing and storage module and liie 
memory, for searching the one or more indexed semantic units for a match with one or more 
semantic units associated with a user query, and for retrieving the stored audio based data 
based on the one or more indexed semantic units. 




10 
11 
12 
13 
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METHODS AND APPARATUS FOR SEMANTIC UNIT BASED AUTOMATIC 



INDEXING AND SEARCTTTNO IN DATA ARCfflVE SYSTEMS 



Abstract of the Disclosure 

An audio-based data indexing and retrieval system for processing audio-based data 
associated with a particular language, comprising: (i) memory for storing the audio-based 
data;(ii) a semantic unit based speech recognition system for generating a textual 
representation of the audio-based data, the textual representation being in the form of one 
or more semantic units corresponding to the audio-based data; (iii) an indexing and storage 
module, operatively coupled to the semantic unit based speech recognition system and the 
memory, for indexing the one or more semantic units and storing the one or more indexed 
semantic units; and (iv) a search engine, operatively coupled to the indexing and storage 
module and the memory, for searching the one or more indexed semantic units for a match 
with one or more semantic units associated with a user query, and for retrieving the stored 
audio based data based on the one or more indexed semantic units. The semantic unit may 
preferably be a syllable or morpheme. Further, the invention is particularly well suited for 
use with Asian and Slavic languages. 

1500-66.APP 



Y0999-426 



21 



(Jk5 ITS 



•20^ 



2-1 O 
_1_ 




— s — 



20b 



3 Is 




3>oc 



30-2, 



^ 30A- 




106 



^Tf^sc<R< pnoJy^ 3 CO 



314-^ 




1 


r 




















r 








« . 












^1^ 




NOU 9 '9S 17:44 
Xl/a9/199S 14: 5B 



FROM IBM SPEECH RECC 



TO 9151S7599512 



PAGE. 202 
PAGE 82 



Attorney Doi;kc!t No, YQ99942§ 



DECLARATION 



AS A BELOW NAMH) IN VENTCM^ I hereby declare timt 

My xesidc^j po^toflGipe «4dzess«!td dtizeosbip a3:e &$ stated next to Tuy oame. 

I brfiev« that I am orig^ial, film and hoh(if only one name is li^^bd(?w), or aa griguial, first andjoint inveatoi CjXp^w?'Gi 
names aire listed below), of the subject matter ^Ji'hich is claimed and for which & patmt t$ $ou^ on tJie isv^tion «tttia«d: 
miE: METHODS AND APf ARATUS FOR SEMANTIC UNIT BASED AUTOMATIC I^P[)EXINCt AND SEARCHING 

L\ DATA ARCHIVE SYSTEMS j 

1 

th« fipociJRcation of which is attactwd hcr?^ or indicates an attorney dodcet no. YQ999^26. or I 
□ was filed in ttie U.S. Patent & Trademaric Office on and assi^ied Serial No. j 



O and (if applicable) was am^ded on . 



I hereby state that 1 have revtewed aad undmtand the contents of the above-idemiSed sjwiii^cm^ ixicludmg the ciauas, 
^ ^jo^oded by any anKEadinsat T^ef?^ to above, i actewvledge tibc <&Dy lo disclose jttfwnmicHi which is i^atoiai to pa^mabilsry 
aod to ifoe examination of this plication ic accordajace with Tttie 37, Code of Ffderat ReguJatioos §1 .56. if bwby cto fomipi 
priority beaefit? isuoder Title 35, U,S, Cod» § 1 J 9(atKd) or §365(b) of any foreiga ^Ecation(s} for parent mvmiof^ t^^<^. 
Of §365(a) of smy FCT inlerQ^ion^ application whici dwigjiatod ^ ic&st one coisiatry oftter tha^i the Utiited S|a^, or § 1 19{«) of as^y 
Uniti&d States provisional 5^^1ication(s), Usted 1>eIow md h*ve also idemifkd below a^y foreigja applicatjons S:>r pmm or kiveEitor ' & 
certificaie having a filing date befoi^ rtiat of the ^Heation on which priority is claimed; 

PrIorttvClaiiaed : 
_Ycs{] No[] 



{Application Number} 



{Comtry) 



{Dc^/MontltfYear filed) 



(J^pUcoiian Number) 



(Day/MorUh/rear fiUdi 



_Yes ( ] No[ ] 



I hereby chtmi the benefit under Title 35, U.S. Code §120, of any Umted States ^pUcahoii(sX o| §365<cX of any PCT 
lutemation*! application dcsisnating ihe United States, listed below ajtid, irmfax ^ the $ui>joct matt«r of ea^ of the claiois of this 
ajpplicatioii is not disclosed m the prior United States oj KTF ftilwiradonal appUcations<s) in the msijnefi prtyvi^ by fhe fet 
paiagr;^h of Title 35^ U,S. Code § 112, 1 actoowledge fee duty to disclose iafonjiation matcml to pat«3i^^lity as defined m Title 
37, Code of Fedeial Regulatbus § 1 .56 which became avaflable between Ac filing date of the prior applicatioii mti fee national or 
PCT iotemaaotjial fflttig date of this ^licjUion; i 



{AppU^on Serial Number) 



{Filing Date) 



(^TATUS: patented, pending, abandoned) \ 



{Application Serial Number) 



(FUingDate) 



(ST A TUS: patented, pendmg, abundcmed) 



J bwrefay ^ m<M\v% attwww; MANNY W- SCaSCTjeR. No. 3 1,722; TERRY X ILAiUM, Re^. No. 

IX>U<aLAS W, CA2K3SRON> No- 5U%; «TEP1£KV C KAUFMAN, Itej No. 2?^5J; JAY F. SBIROlXim, 
m?^«»OGKN, No, 43,502; ROBW M 1V3E?F. R*^ >^fx 25^33; LOUIS r. Wmm&m, R«g, Na 4i,SoO; 
33^06; PAIXI. 1. OTTCRSTEDT, No. 37,4? PAW^SX MORR2$, Reg. No, ^2,05^^ ©AV1ED iwU jgRQlRi, jftje^ 
JJOl^RNATl WAL BliSEHESS MACHINES COKPORAHW, mmas I Watson Research Ccota-, ? O. Box 2 Vvm^vm H^?s, Kt 
prosecute tiiis ^licaboa aa4 td ttaasact aU busiocs* ift Patent ard T^^lejrearit Oi^e eonrwctcd tibm^i^ «»d wi^ 

C()ntiTTu^6iwn-part, rdsstic w re-«xartunJWSQn app^jcaiiosft, wjth M ?«wr *f sppoimmwii ajjd wvtb Ml j>ow«3r to ^KbstjM?? m 
io receive aU pafiattt mfty tiwsneoft, Aftci retiuesi ijjgt all cor«spoa<JeftCe lie addressed to; 



each of tbcin of 

<iivtsio!i:i4}, ix>tt^uatk>;3^ 
attorney or «a4 



i any 



William Lcwi$ 
RYAN & MASON, L.L.P- 
90 Forest Avenue 
Locust Valley. NY U560 
TeL: (516)759-2946 



NOV 9 14:51 



5167539512 



i Page 1 of 2 



NOU 9 '99 17 '44 FROM IBM SPEECH RECO TO 9151675S9512 

11/3S/13S9 14:58 51&7559512 RVAN & MASON, L,L.P, 



PACE. 003 



Attorney D<xtetNo. YO»&9^26 



I HEREBY DECLARE that zll $tarements maife herein of my own knowledge are tscue and that a}l $ta(tenwi4 on icf<MMMaoo 
Mid belief ar« believed to be tru«; aajd fertbcr that liicsc $tatwneiit5 w<sre im<Se Viith &e knowledge that wflMil false statooats and 
A« like so made w puwi$faabk by fine or imprisosimwit, or both, under §1001 (?f Tide !8 US- 0>de and Ihai s^ich mllfu] faise 
statements rmy jeopardi^ the validity of the appKcation or any pater?t issued thereon. 



FULL NAME OF FIRST OR SOLE INVENTOR: Chengiun Jaium Chen 



Invemor's siparure: _ 



Rjesidcace & Post Office add«es$: 





Cidzeiiship 
Date: 



Hawthotte Street 
ite Plaidi New York 10603 



FULL NAME OF SECOND JOn^FT INVENTOR: Dimitri Kimvskv 



Inventor's sigjn*tare: ^ 
Residence & Post OiBce addtess: 



Bss: 1 35S Sprtsig Valley R 



fRoad 

Owmng, New Yoik 10562 



Citizensiisp U,SA. 
Date: H ( V ^ <\ 



Page 2 of 2 



;*'>f^ TOTAL PftGE.0e3 -"lof^ 



NCU 10 ' 99 15=00 FP CD-IBH YORKTOWN 914 94?- 46'?' TO ^isif.^wsr- 

U/:e/1999 10:1S 5XS7599512 KVAN a wsui"*, 



Attorney Docket No. Y0999-426 
IN THE UMTCD STATES PATENT AND TRADEMARK OFFICE 



AFPLICANT(S); 
SERIAL NO,: 
FXLEP: 
FOR: 



Cbengjun Julian Chen et al. 
Concurrently Herewith 

METHODS AND APPARATUS FOR SEMANTIC UNIT BASED 
AUTOMATIC INDEXING AND SEAJ^CHING IN DATA ARCHIVE 
SYSTEMS 



Please recognize JOSEPH B. RYAN, Reg. No. 37,922; KEVIN M, MASON; Reg, No. 
36,597; and WiLtlAM E. LEWIS, Reg. No. 39,274; each of them of RYAN & MASON, 
LX,P., 90 Forest Aventie, Locust Valley, New York 13 560 as associate attorneys in the above- 
meationed application, with fiill power tc prosecute said application, to make alterations and 
amendments therein, and to transact all business in the Patent and Trademaik Office connected 
therewith, 

Telephone calk should he made to WiJliam E. Lewis by dialing (516) 759-2946. 

AB written communications are to be settt to William £. Lewis, Esq., Sy&n Sc 
Mason, LX.P, 90 Forest Avenue, Locust Valley, New York 11560. 



Dated: A/OV' f'^f 




Attorney for Applicant(s) 

Imernatiooal Biisiness Machines 
Corporation 

TJ, Watson BjMswch Center 
Route 1 34 and Kitcbawro Road 
Yorfctown Heights, New York 10598 



j 

TOT^L PAGE. 



